Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

Open
wants to merge 122 commits into
base: 2024
Choose a base branch
from

Conversation

AmadiGabriel
Copy link
Collaborator

@AmadiGabriel AmadiGabriel commented Jun 8, 2024

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a paper label and GitHub Actions will be run to check and build your paper.

See the project readme for more information.

Editor: Chris Calloway @cbcunc

Reviewers:

@ameyxd ameyxd self-assigned this Jun 8, 2024
@ameyxd ameyxd added the paper This indicates that the PR in question is a paper label Jun 8, 2024
@mepa mepa changed the title paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions Jun 9, 2024
Copy link

github-actions bot commented Jun 9, 2024

Curvenote Preview

Directory Preview Checks Updated (UTC)
papers/amadi_udu 🔍 Inspect 80 checks passed (4 optional) Jul 10, 2024, 3:11 PM

@AmadiGabriel
Copy link
Collaborator Author

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

  • Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.
  • Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.
  • Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.
  • Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for your insightful comments that has enhance the content and quality of the paper.
Our ideas are relate closely to the latter two questions asked.

We have provided a more unified conclusion that clarifies that the paper is a preliminary study that considers five datasets with substantial sample size, characterised by class imbalance. A justification for the selection of the models has been included in the text. This informed the choice of PFI for the feature selection process owing to its advantage of being model-agnostic. Expansion to include other models and much larger datasets has been included in the conclusion for further study.

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

  • Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.
  • Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.
  • Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.
  • Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for the insightful comments provided to this paper, which has enhanced the quality and richness of the paper.
The main contribution of the work border on the latter two questions you raised. Accordingly, we have provided a more unified conclusion and justified the selection of the models. As this is a preliminary investigation, we have also included as part of future work an expansion to introduce some quantitative measure of the variability of models and feature selection methods. Other comments have also been addressed.

@AmadiGabriel
Copy link
Collaborator Author

AmadiGabriel commented Jul 8, 2024

A succinct and interesting read on evaluating permutation feature importance (PFI) impacts on three different classification models (Random Forest, LightGBM, and SVM) with varying proportions of subsampled data featuring unbalanced classes. I have minor comments but overall I think this a great contribution.

  • The dual axes in the processing time figure were odd to me at first; it might be valuable to explain that SVM's poor performance relative to the other two methods is likely due to its poor parallelizability (if that's a word)
  • The "decrease in AUC" figures are confusing in that negative x-axis values must therefore indicate increased in AUC? (Correct me if I am misunderstanding). This forces the reader to think about a "double negative makes a positive" which adds possibly unnecessary complexity to interpretation. I would recommend either 1) changing the axis / measure to just be "change in AUC" and/or 2) adding annotations directly onto the white space with an arrow indicating "poorer performance this direction" or similar.

I particularly appreciated the pre-filtering step of using hierarchical clustering of features to account for potential collinearities. I also appreciated that the authors used multiple data sets and evaluated at a range of sample proportions. This is a nice example of how a lot of scientific computing python libraries can come together into a single interesting experiment.

Thank you @janeadams for the observations and review of the paper. This has provided clarity to aspects of the data visualisation and improved the deductions on the model performance.

An explanation for SVM’s poor performance included in the text.
Axis changed to “change in AUC”. More explanation has been included to clarify positive and negative PFI performance results.

@apaleyes
Copy link
Collaborator

apaleyes commented Jul 9, 2024

Lovely, thanks for all the work on updating the paper @AmadiGabriel ! I'll have another look shortly

@apaleyes
Copy link
Collaborator

apaleyes commented Aug 7, 2024

"shortly" ha-ha (one month later)

Anyhow, @cbcunc @ameyxd I am happy with the changes made to the paper and how the comments were addressed. If I am reading it right some additional experiments were run, quite impressive!

To the authors, I still think the number of features in the datasets reviewed isn't big enough to justify feature selection. It absolutely works as the first step, but would be nice to see the follow up work on larger datasets, as the future work paragraph promises. Same conference next year?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
paper This indicates that the PR in question is a paper ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants